I completed all the DataCamp exercises prior to proceeding to this exercise. I also looked up some details how to use the RMarkdown more productively and hopefully this report will be clearer than the previous ones.
Today we are looking at the Boston crime
| Variables | Description |
|---|---|
| crim | per capita crime rate by town. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| nox | nitrogen oxides concentration (parts per 10 million). |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| lstat | lower status of the population (percent). |
| medv | median value of owner-occupied homes divided by $1000s. |
# To empty the memory after the excercise before this
# rm(list=ls())
# Load data
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # to set working directory to source file
# load the data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
All are numeric variables except chas and rad. The data has demographical data of boston including tax and other information possibly linked to crime rates within the city.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
g = ggpairs(Boston,columns = c(1:14), lower = list(continuous = my_fn))
g
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
More focused figure on correlation, since they are hardly visible with this many variables.
# Scaling the variables with mean and standard deviation with scale()
Boston_scaled <- as.data.frame(scale(Boston))
# create a quantile vector of crime and print it
bins <- quantile(Boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crim'
crime <- cut(Boston_scaled$crim, breaks = bins, include.lowest= TRUE, label = c("low","med_low","med_high", "high"))
# remove original crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
# add the new categorical value to scaled data
Boston_scaled <- data.frame(Boston_scaled, crime)
# number of rows in the Boston dataset
n <-nrow(Boston)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- Boston_scaled[ind,]
# create test set
test <- Boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <-test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
summary(train)
## zn indus chas
## Min. :-0.48724 Min. :-1.556302 Min. :-0.27233
## 1st Qu.:-0.48724 1st Qu.:-0.875579 1st Qu.:-0.27233
## Median :-0.48724 Median :-0.210890 Median :-0.27233
## Mean : 0.01699 Mean :-0.001508 Mean : 0.02977
## 3rd Qu.: 0.04872 3rd Qu.: 1.014995 3rd Qu.:-0.27233
## Max. : 3.80047 Max. : 2.420170 Max. : 3.66477
## nox rm age
## Min. :-1.4299136 Min. :-3.876413 Min. :-2.333128
## 1st Qu.:-0.8776070 1st Qu.:-0.576252 1st Qu.:-0.834844
## Median :-0.1440749 Median :-0.113340 Median : 0.317068
## Mean :-0.0007691 Mean :-0.006532 Mean :-0.003215
## 3rd Qu.: 0.5980871 3rd Qu.: 0.498658 3rd Qu.: 0.913895
## Max. : 2.7296452 Max. : 3.551530 Max. : 1.116390
## dis rad tax
## Min. :-1.265817 Min. :-0.98187 Min. :-1.3127
## 1st Qu.:-0.821738 1st Qu.:-0.63733 1st Qu.:-0.7698
## Median :-0.263945 Median :-0.52248 Median :-0.4642
## Mean : 0.005829 Mean :-0.02017 Mean :-0.0127
## 3rd Qu.: 0.688655 3rd Qu.: 1.65960 3rd Qu.: 1.5294
## Max. : 3.956602 Max. : 1.65960 Max. : 1.7964
## ptratio black lstat
## Min. :-2.70470 Min. :-3.903331 Min. :-1.52961
## 1st Qu.:-0.67232 1st Qu.: 0.204020 1st Qu.:-0.81403
## Median : 0.18221 Median : 0.381193 Median :-0.22518
## Mean :-0.01833 Mean : 0.009689 Mean :-0.01128
## 3rd Qu.: 0.80578 3rd Qu.: 0.434619 3rd Qu.: 0.62203
## Max. : 1.63721 Max. : 0.440616 Max. : 3.54526
## medv crime
## Min. :-1.90634 low :105
## 1st Qu.:-0.62605 med_low : 99
## Median :-0.12317 med_high:103
## Mean : 0.01877 high : 97
## 3rd Qu.: 0.42320
## Max. : 2.98650
Now we have generated the test dataset with the new categorial variable “crime”, which is based on the quantiles of the original numeric variable.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2599010 0.2450495 0.2549505 0.2400990
##
## Group means:
## zn indus chas nox rm
## low 0.98181218 -0.8874203 -0.08484810 -0.8681635 0.44327317
## med_low -0.09246736 -0.2564451 0.04582045 -0.5410539 -0.14233658
## med_high -0.38649973 0.1872612 0.22458650 0.4101587 0.09019894
## high -0.48724019 1.0172187 -0.06938576 1.0532418 -0.45754612
## age dis rad tax ptratio
## low -0.9154949 0.8784211 -0.6920200 -0.7326859 -0.47479927
## med_low -0.2613505 0.3669684 -0.5480059 -0.4597182 -0.04944639
## med_high 0.4062608 -0.4030603 -0.3886824 -0.2862109 -0.27454203
## high 0.8129587 -0.8731332 1.6371072 1.5133254 0.77958792
## black lstat medv
## low 0.38334881 -0.77545046 0.53935749
## med_low 0.30759039 -0.11654502 -0.01630709
## med_high 0.08955669 0.02991653 0.15931666
## high -0.78364082 0.87959845 -0.65818815
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.108786587 0.69635890 -0.88287401
## indus -0.025195867 -0.11859408 0.33131889
## chas -0.078688136 -0.07932550 0.11513358
## nox 0.421648091 -0.74703377 -1.43562384
## rm -0.081555248 -0.08263213 -0.19402192
## age 0.258229188 -0.33591952 0.15274859
## dis -0.126865225 -0.23841496 0.37841106
## rad 2.987922913 0.96187096 -0.06539660
## tax 0.004273758 -0.08664326 0.55862554
## ptratio 0.107441591 0.03939024 -0.26014949
## black -0.136854750 -0.01716850 0.05077511
## lstat 0.236880033 -0.20362552 0.28345251
## medv 0.209732025 -0.31553234 -0.18743418
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9444 0.0412 0.0144
g <- ggord(lda.fit, train$crime, ellipse_pro = 0.95, vec_ext = 2, alpha_el = 0.3, size = 2)
g + theme_tufte() + geom_rangeframe()
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 8 0 0
## med_low 4 19 4 0
## med_high 1 9 13 0
## high 0 0 0 30
In the high -crime quartile, we could predict all cases, but for other classes we didn’t do so well. However, LDA seems to be able to predict our cases with quite good accuracy.
# k-means clustering
BostonS <- scale(Boston) # standardize variables
wss <- (nrow(BostonS)-1)*sum(apply(BostonS,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(BostonS,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
km <-kmeans(BostonS, centers = 4)
# plot the Boston dataset with clusters
pairs(BostonS, col = km$cluster)
We see similar results than in LDA
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
Now we can see the 3D picture of the clusters!